-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map shared memory region as read-only in receiver #248
Conversation
89f4d4b
to
43091bd
Compare
This ensures that the data is not modified, even if shared with Python as zero-copy arrow array. Modifying the data could result in undefined behavior since there might be other subscribers that read the data at the same time.
43091bd
to
5ba0366
Compare
The maintainer of the `shared_memory` crate did not react to our PR, so I decided to create a (temporary) fork.
The shared memory image must not be modified because it might be sent to other receivers too. Trying to modify it will result in a SEGFAULT now that we map the shared memory region as read-only.
@haixuanTao I needed to add a This seems like a common footgun. Is there any way that we can prevent modification of shared memory data through our Python API? Arrow arrays are normally read-only, but it seems like this property either lost after the numpy conversion or ignored by |
Hi philipp, thanks for raising this issue. It seems to be an issue within the As a first inspection, within the Cython code of pyarrow the function def to_numpy(self, zero_copy_only=True, writable=False):
"""
Return a NumPy view or copy of this array (experimental).
By default, tries to return a view of this array. This is only
supported for primitive arrays with the same memory layout as NumPy
(i.e. integers, floating point, ..) and without any nulls.
For the extension arrays, this method simply delegates to the
underlying storage array.
Parameters
----------
zero_copy_only : bool, default True
If True, an exception will be raised if the conversion to a numpy
array would require copying the underlying data (e.g. in presence
of nulls, or for non-primitive types).
writable : bool, default False
For numpy arrays created with zero copy (view on the Arrow data),
the resulting array is not writable (Arrow data is immutable).
By setting this to True, a copy of the array is made to ensure
it is writable.
Returns
-------
array : numpy.ndarray
"""
cdef:
PyObject* out
PandasOptions c_options
object values
if zero_copy_only and writable:
raise ValueError(
"Cannot return a writable array if asking for zero-copy")
# If there are nulls and the array is a DictionaryArray
# decoding the dictionary will make sure nulls are correctly handled.
# Decoding a dictionary does imply a copy by the way,
# so it can't be done if the user requested a zero_copy.
c_options.decode_dictionaries = not zero_copy_only
c_options.zero_copy_only = zero_copy_only
with nogil:
check_status(ConvertArrayToPandas(c_options, self.sp_array,
self, &out))
# wrap_array_output uses pandas to convert to Categorical, here
# always convert to numpy array without pandas dependency
array = PyObject_to_object(out)
if isinstance(array, dict):
array = np.take(array['dictionary'], array['indices'])
if writable and not array.flags.writeable:
# if the conversion already needed to a copy, writeable is True
array = array.copy()
return array Which by default consider that the array is not writable, but i'm not sure if the information is passed on to numpy, where the bug is raised. I'm going to investigate more. |
So it seems that the numpy array created by arrow_array.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : False
ALIGNED : True
WRITEBACKIFCOPY : False And assigning data into it doesn't work: >>> arrow_array[0]=1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only But, it seems that cv2 doesn't take into account this flag. See the following example: import pyarrow as pa
import numpy as np
array = pa.array(np.zeros((100))).to_numpy().reshape((10,10))
assert array.flags.writeable == False, "Array should be not writable"
cv2.rectangle(array,(0,0),(5,5),(255),2)
print(array) array([[255., 255., 255., 255., 255., 255., 255., 0., 0., 0.],
[255., 255., 255., 255., 255., 255., 255., 0., 0., 0.],
[255., 255., 0., 0., 255., 255., 255., 0., 0., 0.],
[255., 255., 0., 0., 255., 255., 255., 0., 0., 0.],
[255., 255., 255., 255., 255., 255., 255., 0., 0., 0.],
[255., 255., 255., 255., 255., 255., 255., 0., 0., 0.],
[255., 255., 255., 255., 255., 255., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) I'll raise the issue with the cv2 team |
I have opened the issue: opencv/opencv-python#859 |
The upstream issue was fixed by opencv/opencv#24026. So I guess we just need to wait for a new release now? |
you mean a opencv 4.9 release right? |
Yes, and a corresponding The question is whether we want to merge this PR anyway without waiting.A SEGFAULT is better than silent data corruption in my opinion... |
I'm ok, with merging this before the next opencv release that seems to not happen in the short term: https://github.com/opencv/opencv/milestone/54 . I think that ultimately, we need to provide the right documentation. But the opencv dependency isn't really a direct dependency of dora-rs. It's a "user" dependency. So we should not delay our planned feature. I will have to adapt code on dora-drives. |
This ensures that the data is not modified, even if shared with Python as zero-copy arrow array. Modifying the data could result in undefined behavior since there might be other subscribers that read the data at the same time. With this PR, a SEGFAULT will occur when attempting to modify shared memory data.
Blocked on elast0ny/shared_memory#100.I created a (temporary) fork of the crate with the nameshared_memory_extended
.